Prepared by: Kulwinder Kaur
In this notebook, I am working on US Mass Shooting data that has been taken from Kaggle. The data is from 1966 to 2017 comprising of upto 50 years worth of data.
The US has witnessed 398 mass shootings in last 50 years that resulted in 1,996 deaths and 2,488 injured. The latest and the worst mass shooting of October 2, 2017 killed 58 and injured 515 so far. The number of people injured in this attack is more than the number of people injured in all mass shootings of 2015 and 2016 combined.
Dataset: The dataset contains detailed information of 398 mass shootings in the United States of America that killed 1996 and injured 2488 people.
Variables: The dataset contains Serial No, Title, Location, Date, Summary, Fatalities, Injured, Total Victims, Mental Health Issue, Race, Gender, and Lat-Long information.
Goal: The purpose of the notebook is to clean the data, prepare a dash app to see the trend, understand geographical impact, impact in terms of lost lives, and shooter's health condition and demographic information. Also, I am aiming to prepare different kind of plotly charts to have a broader view of the data.
#importing the necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
#reading the data from csv file
df=pd.read_csv('Mass Shootings Dataset Ver 5.csv',encoding='latin-1', index_col='S#')
#checking the head
df.head()
| Title | Location | Date | Incident Area | Open/Close Location | Target | Cause | Summary | Fatalities | Injured | Total victims | Policeman Killed | Age | Employeed (Y/N) | Employed at | Mental Health Issues | Race | Gender | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S# | ||||||||||||||||||||
| 1 | Texas church mass shooting | Sutherland Springs, TX | 11/5/2017 | Church | Close | random | unknown | Devin Patrick Kelley, 26, an ex-air force offi... | 26 | 20 | 46 | 0.0 | 26 | NaN | NaN | No | White | M | NaN | NaN |
| 2 | Walmart shooting in suburban Denver | Thornton, CO | 11/1/2017 | Wal-Mart | Open | random | unknown | Scott Allen Ostrem, 47, walked into a Walmart ... | 3 | 0 | 3 | 0.0 | 47 | NaN | NaN | No | White | M | NaN | NaN |
| 3 | Edgewood businees park shooting | Edgewood, MD | 10/18/2017 | Remodeling Store | Close | coworkers | unknown | Radee Labeeb Prince, 37, fatally shot three pe... | 3 | 3 | 6 | 0.0 | 37 | NaN | Advance Granite Store | No | Black | M | NaN | NaN |
| 4 | Las Vegas Strip mass shooting | Las Vegas, NV | 10/1/2017 | Las Vegas Strip Concert outside Mandala Bay | Open | random | unknown | Stephen Craig Paddock, opened fire from the 32... | 59 | 527 | 585 | 1.0 | 64 | NaN | NaN | Unclear | White | M | 36.181271 | -115.134132 |
| 5 | San Francisco UPS shooting | San Francisco, CA | 6/14/2017 | UPS facility | Close | coworkers | NaN | Jimmy Lam, 38, fatally shot three coworkers an... | 3 | 2 | 5 | 0.0 | 38 | 1.0 | NaN | Yes | Asian | M | NaN | NaN |
#checking the shape of the dataframe
df.shape
(323, 20)
#printing out rows and columns
print(f" There are {df.shape[0]} data rows with {df.shape[1]} features." )
There are 323 data rows with 20 features.
#checking the info
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 323 entries, 1 to 323 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title 323 non-null object 1 Location 278 non-null object 2 Date 323 non-null object 3 Incident Area 261 non-null object 4 Open/Close Location 295 non-null object 5 Target 318 non-null object 6 Cause 246 non-null object 7 Summary 323 non-null object 8 Fatalities 323 non-null int64 9 Injured 323 non-null int64 10 Total victims 323 non-null int64 11 Policeman Killed 317 non-null float64 12 Age 179 non-null object 13 Employeed (Y/N) 67 non-null float64 14 Employed at 41 non-null object 15 Mental Health Issues 323 non-null object 16 Race 321 non-null object 17 Gender 323 non-null object 18 Latitude 303 non-null float64 19 Longitude 303 non-null float64 dtypes: float64(4), int64(3), object(13) memory usage: 53.0+ KB
#separating out numerical and categorical columns
df_numeric=df.select_dtypes(include='number')
df_categorical=df.select_dtypes(exclude='number')
#summary statistics of the numerical dataset
df_numeric.describe()
| Fatalities | Injured | Total victims | Policeman Killed | Employeed (Y/N) | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| count | 323.000000 | 323.000000 | 323.000000 | 317.000000 | 67.000000 | 303.000000 | 303.000000 |
| mean | 4.436533 | 6.176471 | 10.263158 | 0.129338 | 0.626866 | 37.225076 | -94.429539 |
| std | 5.783208 | 29.889182 | 33.662309 | 0.610294 | 0.487288 | 5.536365 | 16.513296 |
| min | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 21.325512 | -161.792752 |
| 25% | 1.000000 | 1.000000 | 4.000000 | 0.000000 | 0.000000 | 33.571459 | -110.205485 |
| 50% | 3.000000 | 3.000000 | 5.000000 | 0.000000 | 1.000000 | 36.443290 | -88.122998 |
| 75% | 5.500000 | 5.000000 | 9.000000 | 0.000000 | 1.000000 | 41.483844 | -81.703237 |
| max | 59.000000 | 527.000000 | 585.000000 | 5.000000 | 1.000000 | 60.790539 | -69.707823 |
Observations:
#taking a look at the categorical columns
df_categorical.sample(5)
| Title | Location | Date | Incident Area | Open/Close Location | Target | Cause | Summary | Age | Employed at | Mental Health Issues | Race | Gender | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S# | |||||||||||||
| 79 | Residence in Lakeland, Florida | NaN | 1/6/2016 | Home | Close | drug dealer | NaN | Two men arrived at the house of a drug dealer ... | NaN | NaN | Unknown | Black American or African American | Male |
| 52 | Excel Industries mass shooting | Hesston, Kansas | 2/25/2016 | workplace | Close | random | NaN | Cedric L. Ford, who worked as a painter at a m... | NaN | manufacturing company | Unclear | Black | M |
| 33 | Plantation, Florida | Plantation, Florida | 3/19/2016 | Home | Close | party guests | NaN | A Spring Break party in a house packed with 40... | NaN | NaN | Unknown | Unknown | Unknown |
| 123 | Daytona Beach | Daytona Beach, Florida | 4/3/2015 | Home | Close | random | NaN | On April 3, 2015, four people were shot early ... | NaN | NaN | No | Black American or African American | Male |
| 61 | Kalamazoo Uber Driver Spree Killing | NaN | 2/20/2016 | NaN | Open | random | NaN | An Uber driver fired upon people in three loca... | NaN | uber | No | White American or European American | Male |
Observations:
#checking null values
df.isna().mean()*100
Title 0.000000 Location 13.931889 Date 0.000000 Incident Area 19.195046 Open/Close Location 8.668731 Target 1.547988 Cause 23.839009 Summary 0.000000 Fatalities 0.000000 Injured 0.000000 Total victims 0.000000 Policeman Killed 1.857585 Age 44.582043 Employeed (Y/N) 79.256966 Employed at 87.306502 Mental Health Issues 0.000000 Race 0.619195 Gender 0.000000 Latitude 6.191950 Longitude 6.191950 dtype: float64
Employed at column has ~90% of the vlaues missing, followed by Employed(Y/N) with ~80% and Age with ~44% values missing.
The latitude and longitude column both seem to have missing same number of values at a time since both have missing value percentage as 6.2%. We also have Location column that can be used to fill in the missing values. We can google the coordinates of the corresponding Location and use that to fill in the missing latitude and longitude vlaues.
#cross-checking the rows with missing coordinates
df[df['Latitude'].isnull()][['Location', 'Incident Area', 'Latitude','Longitude']]
| Location | Incident Area | Latitude | Longitude | |
|---|---|---|---|---|
| S# | ||||
| 1 | Sutherland Springs, TX | Church | NaN | NaN |
| 2 | Thornton, CO | Wal-Mart | NaN | NaN |
| 3 | Edgewood, MD | Remodeling Store | NaN | NaN |
| 5 | San Francisco, CA | UPS facility | NaN | NaN |
| 6 | Tunkhannock, PA | Weis grocery | NaN | NaN |
| 7 | Orlando, Florida | manufacturer Fiamma Inc. | NaN | NaN |
| 8 | Kirkersville, Ohio | a nursing home | NaN | NaN |
| 9 | Fresno, California | a street in downtown | NaN | NaN |
| 10 | Fort Lauderdale, Florida | baggage claim area of the airport | NaN | NaN |
| 11 | Burlington, WA | cosmetics section of a Macyâs department store | NaN | NaN |
| 12 | Baton Rouge, LA | NaN | NaN | NaN |
| 13 | Dallas, Texas | at protest | NaN | NaN |
| 14 | Orlando, Florida | at nightclub | NaN | NaN |
| 52 | Hesston, Kansas | workplace | NaN | NaN |
| 58 | Kalamazoo County, Michigan | restaurant parking | NaN | NaN |
| 81 | San Bernardino, California | Christmas Party | NaN | NaN |
| 83 | Colorado Springs, Colorado | health clinic | NaN | NaN |
| 109 | Menasha, Wisconsin | NaN | NaN | NaN |
| 155 | Santa Barbara, California | NaN | NaN | NaN |
| 157 | Fort Hood, Texas | Fort Hood Army Post | NaN | NaN |
#latitude and longitude values from google
nan_coordinates= {1: [29.2732, -98.0564],
2: [39.903057, -104.954445],
3: [39.418719, -76.294402],
5: [37.828724, -122.355537],
6: [41.5386878, -75.94658800000002],
7: [28.5383355, -81.3792365],
8: [39.95886, -82.5965 ],
9: [36.746841, -119.772591],
10: [26.0702, -80.1517],
11: [48.4756621, -122.3254375],
12: [30.471165, -91.147385],
13: [32.779167, -96.808891],
14: [28.538336, -81.379234],
52: [38.138344, -97.431427],
58: [-85.591736, 42.299152],
81: [34.115784, -117.302399],
83: [38.859055, -104.813499],
109: [44.2022, -88.4465],
155: [34.420830, -119.698189],
157: [31.195001, -97.741314]}
#filling missing values
for item in nan_coordinates.keys():
df.loc[item, 'Latitude']=nan_coordinates[item][0]
df.loc[item, 'Longitude']=nan_coordinates[item][1]
#looking at missing location and corresponding title, lat and lon columns
df[df['Location'].isnull()][['Location', 'Title', 'Latitude','Longitude']]
| Location | Title | Latitude | Longitude | |
|---|---|---|---|---|
| S# | ||||
| 16 | NaN | Forestville, Maryland Drive-by | 38.845113 | -76.874972 |
| 17 | NaN | Halifax County, VA | 36.765971 | -78.928344 |
| 18 | NaN | Tire-Slashing revenge escalation | 39.290385 | -76.612189 |
| 19 | NaN | Chicago Rap video Shootout | 41.878114 | -87.629798 |
| 20 | NaN | Texas family murder-suicide | 29.785785 | -95.824396 |
| 21 | NaN | Alabama highway random shooting | 34.162040 | -86.475543 |
| 22 | NaN | Long Beach Street murder | 33.770050 | -118.193739 |
| 23 | NaN | Albuquerque, NM House party shooting | 35.085334 | -106.605553 |
| 24 | NaN | Memphis, TN gas station shooting | 35.149534 | -90.048980 |
| 25 | NaN | Chicago Birthday Party Bus Shooting | 41.878114 | -87.629798 |
| 26 | NaN | Albuquerque, NM Family restaurant shooting | 35.085334 | -106.605553 |
| 30 | NaN | Greenhill, AL Family murder-suicide | 34.759257 | -86.602493 |
| 35 | NaN | Atlanta Nightclub shooting | 33.748995 | -84.387982 |
| 36 | NaN | Florida Family celebration ends in shooting | 26.640628 | -81.872308 |
| 37 | NaN | Elberton, Georgia murder-suicide | 34.111223 | -82.867084 |
| 38 | NaN | Trenton, NJ sidewalk shooting | 40.217053 | -74.742938 |
| 39 | NaN | Detroit Drive-by | 42.331427 | -83.045754 |
| 41 | NaN | Wilkinsburg BBQ cookout shootout | 40.441736 | -79.881994 |
| 43 | NaN | Lafayette, LA drive by | 30.224090 | -92.019843 |
| 44 | NaN | Kansas City Home Invasion | 39.114053 | -94.627464 |
| 46 | NaN | Roswell, GA Hookah shootout | 34.023243 | -84.361556 |
| 47 | NaN | Wichita Nightclub shooting | 37.687176 | -97.330053 |
| 48 | NaN | Detroit Strip Club Shootout | 42.331427 | -83.045754 |
| 49 | NaN | Riverside restaurant shooting | 33.997197 | -117.485480 |
| 54 | NaN | Belfair, WA Family murder-suicide | 47.451459 | -122.826946 |
| 56 | NaN | Missouri Highway shooting | 38.771440 | -90.370949 |
| 57 | NaN | Houston Drive-by | 29.760427 | -95.369803 |
| 60 | NaN | Florida Hookah Bar Shooting | 27.950575 | -82.457178 |
| 61 | NaN | Kalamazoo Uber Driver Spree Killing | 42.291707 | -85.587229 |
| 62 | NaN | Vallejo, CA car shooting | 38.104086 | -122.256637 |
| 63 | NaN | Michigan school shooting | 43.201126 | -86.238946 |
| 65 | NaN | Nightclub fight turns into shootout | 43.161030 | -77.610922 |
| 66 | NaN | Tampa, FL Strip club shooting | 27.950575 | -82.457178 |
| 67 | NaN | Los Angeles, CA street shooting | 34.052234 | -118.243685 |
| 68 | NaN | Texas teen commits family murder suicide | 29.209684 | -99.786168 |
| 69 | NaN | New Orleans gas station shoot-out | 29.951066 | -90.071532 |
| 70 | NaN | Glendale, AZ House Party shooting | 33.538652 | -112.185987 |
| 71 | NaN | Caroline County, VA Family Murder Suicide | 38.044556 | -77.286488 |
| 73 | NaN | Perris, CA Mexican restaurant shooting | 33.782519 | -117.228648 |
| 74 | NaN | Florida family murder | 30.754181 | -86.572799 |
| 75 | NaN | Los Angeles Drive-by | 34.052234 | -118.243685 |
| 76 | NaN | Gloucester County, VA, House Party | 37.402640 | -76.458559 |
| 77 | NaN | Wilmington, DE robbery | 39.739072 | -75.539788 |
| 78 | NaN | Roadside in Memphis, Tennessee | 35.149534 | -90.048980 |
| 79 | NaN | Residence in Lakeland, Florida | 28.039465 | -81.949804 |
We can see that the Title column has the information about the city and state. We can split the column value by comma and separarte out the state part.
#splitting title with space just to see the values that are returned
df[df['Location'].isnull()]['Title'].str.split(' ')
S# 16 [Forestville,, Maryland, Drive-by] 17 [Halifax, County,, VA] 18 [Tire-Slashing, revenge, escalation] 19 [Chicago, Rap, video, Shootout] 20 [Texas, family, murder-suicide] 21 [Alabama, highway, random, shooting] 22 [Long, Beach, Street, murder] 23 [Albuquerque,, NM, House, party, shooting] 24 [Memphis,, TN, gas, station, shooting] 25 [Chicago, Birthday, Party, Bus, Shooting] 26 [Albuquerque,, NM, Family, restaurant, shooting] 30 [Greenhill,, AL, Family, murder-suicide] 35 [Atlanta, Nightclub, shooting] 36 [Florida, Family, celebration, ends, in, shoot... 37 [Elberton,, Georgia, murder-suicide] 38 [Trenton,, NJ, sidewalk, shooting] 39 [Detroit, Drive-by] 41 [Wilkinsburg, BBQ, cookout, shootout] 43 [Lafayette,, LA, drive, by] 44 [Kansas, City, Home, Invasion] 46 [Roswell,, GA, Hookah, shootout] 47 [Wichita, Nightclub, shooting] 48 [Detroit, Strip, Club, Shootout] 49 [Riverside, restaurant, shooting] 54 [Belfair,, WA, Family, murder-suicide] 56 [Missouri, Highway, shooting] 57 [Houston, Drive-by] 60 [Florida, Hookah, Bar, Shooting] 61 [Kalamazoo, Uber, Driver, Spree, Killing] 62 [Vallejo,, CA, car, shooting] 63 [Michigan, school, shooting] 65 [Nightclub, fight, turns, into, shootout] 66 [Tampa,, FL, Strip, club, shooting] 67 [Los, Angeles,, CA, street, shooting] 68 [Texas, teen, commits, family, murder, suicide] 69 [New, Orleans, gas, station, shoot-out] 70 [Glendale,, AZ, House, Party, shooting] 71 [Caroline, County,, VA, Family, Murder, Suicide] 73 [Perris,, CA, Mexican, restaurant, shooting] 74 [Florida, family, murder] 75 [Los, Angeles, Drive-by] 76 [Gloucester, County,, VA,, House, Party] 77 [Wilmington,, DE, robbery] 78 [Roadside, in, Memphis,, Tennessee] 79 [Residence, in, Lakeland,, Florida] Name: Title, dtype: object
#we can filter out the splitted title column in separate dataframe.
# splitting the string by comma and keeping one entry before and after comma
# so that we can get the city and state values.
cf=df[df['Location'].isnull()]['Title'].apply(lambda x: x.split(',')[0].split(' ')[-1] + ', '+ x.split(',')[1].split(' ')[1] if ',' in x else x.split(' ')[:2])
cf
S# 16 Forestville, Maryland 17 County, VA 18 [Tire-Slashing, revenge] 19 [Chicago, Rap] 20 [Texas, family] 21 [Alabama, highway] 22 [Long, Beach] 23 Albuquerque, NM 24 Memphis, TN 25 [Chicago, Birthday] 26 Albuquerque, NM 30 Greenhill, AL 35 [Atlanta, Nightclub] 36 [Florida, Family] 37 Elberton, Georgia 38 Trenton, NJ 39 [Detroit, Drive-by] 41 [Wilkinsburg, BBQ] 43 Lafayette, LA 44 [Kansas, City] 46 Roswell, GA 47 [Wichita, Nightclub] 48 [Detroit, Strip] 49 [Riverside, restaurant] 54 Belfair, WA 56 [Missouri, Highway] 57 [Houston, Drive-by] 60 [Florida, Hookah] 61 [Kalamazoo, Uber] 62 Vallejo, CA 63 [Michigan, school] 65 [Nightclub, fight] 66 Tampa, FL 67 Angeles, CA 68 [Texas, teen] 69 [New, Orleans] 70 Glendale, AZ 71 County, VA 73 Perris, CA 74 [Florida, family] 75 [Los, Angeles] 76 County, VA 77 Wilmington, DE 78 Memphis, Tennessee 79 Lakeland, Florida Name: Title, dtype: object
It appears that some entries are received correctly with city and state values however some row values are returned as list where there was no comma in between string.
df.iloc[48,:]
Title Riverside restaurant shooting Location NaN Date 2/28/2016 Incident Area restaurant parking Open/Close Location Open Target random Cause NaN Summary Shooter fired upon group of people in a restau... Fatalities 1 Injured 3 Total victims 4 Policeman Killed 0.0 Age NaN Employeed (Y/N) NaN Employed at NaN Mental Health Issues Unknown Race Some other race Gender Male Latitude 33.997197 Longitude -117.48548 Name: 49, dtype: object
#by visually checking the second element in cf title items and determining the state name
#replacing the second element with the state abbreviation
#index 18 and index 49 have been filled with state where the coodinated show up in google
fill_loc={18: 'IL',
19: 'IL',
20: 'TX',
21: 'AL',
22: 'CA',
25: 'IL',
35: 'GA',
36: 'FL',
39: 'MI',
41: 'PA',
44: 'MO',
47: 'KS',
48: 'MI',
49: 'CA' ,
56: 'MO',
57: 'TX',
60: 'FL',
61: 'MI',
63: 'MI',
68: 'TX',
69: 'LA',
74: 'FL',
75: 'CA'}
#filling second item in cf by state abbreviations which were entered manually
for i in fill_loc.keys():
cf[i][1]=fill_loc[i]
cf[i]=', '.join(cf[i])
#filled Cf location
cf
S# 16 Forestville, Maryland 17 County, VA 18 Tire-Slashing, IL 19 Chicago, IL 20 Texas, TX 21 Alabama, AL 22 Long, CA 23 Albuquerque, NM 24 Memphis, TN 25 Chicago, IL 26 Albuquerque, NM 30 Greenhill, AL 35 Atlanta, GA 36 Florida, FL 37 Elberton, Georgia 38 Trenton, NJ 39 Detroit, MI 41 Wilkinsburg, PA 43 Lafayette, LA 44 Kansas, MO 46 Roswell, GA 47 Wichita, KS 48 Detroit, MI 49 Riverside, CA 54 Belfair, WA 56 Missouri, MO 57 Houston, TX 60 Florida, FL 61 Kalamazoo, MI 62 Vallejo, CA 63 Michigan, MI 65 [Nightclub, fight] 66 Tampa, FL 67 Angeles, CA 68 Texas, TX 69 New, LA 70 Glendale, AZ 71 County, VA 73 Perris, CA 74 Florida, FL 75 Los, CA 76 County, VA 77 Wilmington, DE 78 Memphis, Tennessee 79 Lakeland, Florida Name: Title, dtype: object
#splitting the state part as separate column
cf= cf.str.split(pat=',', n=1, expand=True)
cf
| 0 | 1 | |
|---|---|---|
| S# | ||
| 16 | Forestville | Maryland |
| 17 | County | VA |
| 18 | Tire-Slashing | IL |
| 19 | Chicago | IL |
| 20 | Texas | TX |
| 21 | Alabama | AL |
| 22 | Long | CA |
| 23 | Albuquerque | NM |
| 24 | Memphis | TN |
| 25 | Chicago | IL |
| 26 | Albuquerque | NM |
| 30 | Greenhill | AL |
| 35 | Atlanta | GA |
| 36 | Florida | FL |
| 37 | Elberton | Georgia |
| 38 | Trenton | NJ |
| 39 | Detroit | MI |
| 41 | Wilkinsburg | PA |
| 43 | Lafayette | LA |
| 44 | Kansas | MO |
| 46 | Roswell | GA |
| 47 | Wichita | KS |
| 48 | Detroit | MI |
| 49 | Riverside | CA |
| 54 | Belfair | WA |
| 56 | Missouri | MO |
| 57 | Houston | TX |
| 60 | Florida | FL |
| 61 | Kalamazoo | MI |
| 62 | Vallejo | CA |
| 63 | Michigan | MI |
| 65 | NaN | NaN |
| 66 | Tampa | FL |
| 67 | Angeles | CA |
| 68 | Texas | TX |
| 69 | New | LA |
| 70 | Glendale | AZ |
| 71 | County | VA |
| 73 | Perris | CA |
| 74 | Florida | FL |
| 75 | Los | CA |
| 76 | County | VA |
| 77 | Wilmington | DE |
| 78 | Memphis | Tennessee |
| 79 | Lakeland | Florida |
We can separate out city and state from the location column for extra analysis.
#joining the state part of the location to actual df
df=df.join(df['Location'].str.split(pat=',',n= 1,expand=True).rename(columns={0: 'City', 1:'State'}))
#replacing state names with the state names in cleaned CF state column
df.loc[df.index.isin(cf.index), "State"]=cf[1]
df.loc[df['State'].isna()]
| Title | Location | Date | Incident Area | Open/Close Location | Target | Cause | Summary | Fatalities | Injured | ... | Age | Employeed (Y/N) | Employed at | Mental Health Issues | Race | Gender | Latitude | Longitude | City | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S# | |||||||||||||||||||||
| 65 | Nightclub fight turns into shootout | NaN | 2/7/2016 | nightclub | Close | random | anger | Two groups of people fired at each other after... | 1 | 7 | ... | NaN | NaN | NaN | Unknown | Unknown | Unknown | 43.161030 | -77.610922 | NaN | NaN |
| 164 | Washington Navy Yard | Washington D.C. | 9/16/2013 | NaN | Close | random | terrorism | On September 16, 2013, a 34-yearl old contract... | 13 | 3 | ... | 34 | 1.0 | Navy Yard | Yes | Black American or African American | Male | 38.904809 | -77.016297 | Washington D.C. | None |
2 rows × 22 columns
#manualy filling the two null values
df.loc[65, 'State']='NY'
df.loc[164, 'State']='Washington D.C.'
#creating a state abbreviation dictionary from google
states_abb= {
'AL': 'Alabama',
'KY': 'Kentucky',
'OH': 'Ohio',
'AK': 'Alaska',
'LA': 'Louisiana',
'OK': 'Oklahoma',
'AZ': 'Arizona',
'ME': 'Maine',
'OR': 'Oregon',
'AR': 'Arkansas',
'MD': 'Maryland',
'PA': 'Pennsylvania',
'AS': 'American Samoa',
'MA': 'Massachusetts',
'PR': 'Puerto Rico',
'CA': 'California',
'MI': 'Michigan',
'RI': 'Rhode Island',
'CO': 'Colorado',
'MN': 'Minnesota',
'SC': 'South Carolina',
'CT': 'Connecticut',
'MS': 'Mississippi',
'SD': 'South Dakota',
'DE': 'Delaware',
'MO': 'Missouri',
'TN': 'Tennessee',
'DC': 'District of Columbia',
'MT': 'Montana',
'TX': 'Texas',
'FL': 'Florida',
'NE': 'Nebraska',
'GA': 'Georgia',
'NV': 'Nevada',
'UT': 'Utah',
'GU': 'Guam',
'NH': 'New Hampshire',
'VT': 'Vermont',
'HI': 'Hawaii',
'NJ': 'New Jersey',
'VA': 'Virginia',
'ID': 'Idaho',
'NM': 'New Mexico',
'VI': 'Virgin Islands',
'IL': 'Illinois',
'NY': 'New York',
'WA': 'Washington',
'IN': 'Indiana',
'NC': 'North Carolina',
'WV': 'West Virginia',
'IA': 'Iowa',
'ND': 'North Dakota',
'WI': 'Wisconsin',
'KS': 'Kansas',
'MP': 'Northern Mariana Islands',
'WY': 'Wyoming',
'TT': 'Trust Territories'
}
#replacing the abbreviation by name
for i in states_abb.keys():
df.loc[df['State'].str.strip()==i, 'State']=states_abb[i]
df.isna().sum()
Title 0 Location 45 Date 0 Incident Area 62 Open/Close Location 28 Target 5 Cause 77 Summary 0 Fatalities 0 Injured 0 Total victims 0 Policeman Killed 6 Age 144 Employeed (Y/N) 256 Employed at 282 Mental Health Issues 0 Race 2 Gender 0 Latitude 0 Longitude 0 City 45 State 0 dtype: int64
#cross checking values
df['State']
S#
1 Texas
2 Colorado
3 Maryland
4 Nevada
5 California
...
319 Illinois
320 Louisiana
321 Washington
322 Arizona
323 Texas
Name: State, Length: 323, dtype: object
#crosschecking
df['State'].value_counts().index
Index([' California', ' Florida', ' Texas', ' Georgia', ' Washington',
' Arizona', ' North Carolina', ' Ohio', ' New York', ' Alabama',
' Wisconsin', ' Illinois', 'California', ' Colorado', ' Tennessee',
' Michigan', ' Pennsylvania', ' Virginia', ' South Carolina',
' Kentucky', ' Oklahoma', ' Nevada', ' Mississippi', ' Minnesota',
'Michigan', ' Louisiana', ' Massachusetts', ' Kansas', ' Oregon',
'Texas', 'Florida', ' New Jersey', 'Louisiana', ' Missouri', 'Virginia',
' Connecticut', 'Illinois', ' Nebraska', 'Washington', ' Arkansas',
' Montana', 'Pennsylvania', 'Alabama', ' Utah', 'New Mexico', 'Georgia',
'Missouri', ' New Mexico',
' Souderton, Lansdale, Harleysville, Pennsylvania',
' Lancaster, Pennsylvania', ' Vermont', ' Albuquerque, New Mexico',
'Washington D.C.', ' San Diego, California', ' Hawaii', ' Alaska',
' Wyoming', 'Maryland', ' West Virginia', 'New Jersey', ' Idaho',
'Nevada', ' Indiana', ' Maryland', ' South Dakota', ' Maine',
'Delaware', 'Arizona', 'New York', 'Tennessee', 'Colorado', ' Texas ',
' Virginia', 'Kansas', ' Iowa'],
dtype='object')
#it seems that state names have space infront of the name
#stripping the space on both sides
df['State']=df['State'].str.strip()
#rechecking
df['State'].value_counts().index
Index(['California', 'Florida', 'Texas', 'Georgia', 'Washington', 'Arizona',
'Illinois', 'New York', 'North Carolina', 'Alabama', 'Michigan', 'Ohio',
'Virginia', 'Wisconsin', 'Pennsylvania', 'Louisiana', 'Tennessee',
'Colorado', 'Nevada', 'South Carolina', 'Oklahoma', 'Kansas',
'Missouri', 'Kentucky', 'Massachusetts', 'Minnesota', 'Oregon',
'New Mexico', 'New Jersey', 'Mississippi', 'Nebraska', 'Connecticut',
'Montana', 'Maryland', 'Utah', 'Arkansas', 'Wyoming', 'Alaska',
'Hawaii', 'San Diego, California', 'Vermont', 'Lancaster, Pennsylvania',
'Delaware', 'Albuquerque, New Mexico', 'Washington D.C.',
'West Virginia', 'Souderton, Lansdale, Harleysville, Pennsylvania',
'Idaho', 'Indiana', 'South Dakota', 'Maine', 'Iowa'],
dtype='object')
We still have some states with city name followed by comma and then the state name. We will try to remove that below.
#removing city name
df.loc[df['State'].str.contains(','), 'State']= df.loc[df['State'].str.contains(',')]['State'].str.split(' ').str[-1]
df['State'].value_counts().index
Index(['California', 'Florida', 'Texas', 'Washington', 'Georgia', 'Arizona',
'Illinois', 'North Carolina', 'Alabama', 'New York', 'Pennsylvania',
'Ohio', 'Michigan', 'Virginia', 'Wisconsin', 'Colorado', 'Louisiana',
'Tennessee', 'Nevada', 'Missouri', 'South Carolina', 'Kansas',
'Oklahoma', 'Kentucky', 'Oregon', 'Minnesota', 'Mississippi',
'Massachusetts', 'New Jersey', 'New Mexico', 'Nebraska', 'Connecticut',
'Montana', 'Maryland', 'Utah', 'Arkansas', 'Wyoming', 'Alaska',
'Hawaii', 'Vermont', 'West Virginia', 'Mexico', 'Washington D.C.',
'Delaware', 'Idaho', 'Indiana', 'South Dakota', 'Maine', 'Iowa'],
dtype='object')
#doing a value count
df['State'].value_counts()
California 37 Florida 25 Texas 21 Washington 16 Georgia 16 Arizona 12 Illinois 12 North Carolina 11 Alabama 11 New York 11 Pennsylvania 10 Ohio 10 Michigan 10 Virginia 9 Wisconsin 9 Colorado 7 Louisiana 7 Tennessee 7 Nevada 6 Missouri 5 South Carolina 5 Kansas 5 Oklahoma 5 Kentucky 5 Oregon 4 Minnesota 4 Mississippi 4 Massachusetts 4 New Jersey 4 New Mexico 4 Nebraska 3 Connecticut 3 Montana 2 Maryland 2 Utah 2 Arkansas 2 Wyoming 1 Alaska 1 Hawaii 1 Vermont 1 West Virginia 1 Mexico 1 Washington D.C. 1 Delaware 1 Idaho 1 Indiana 1 South Dakota 1 Maine 1 Iowa 1 Name: State, dtype: int64
# we will also add another column for state abbreviation so that it can be used in making chrolopleth charts
def find_keys(x):
for i in states_abb:
if states_abb[i]==x:
return i
df['State_abb']=df['State'].apply(lambda x : find_keys(x))
df['State_abb']
S#
1 TX
2 CO
3 MD
4 NV
5 CA
..
319 IL
320 LA
321 WA
322 AZ
323 TX
Name: State_abb, Length: 323, dtype: object
Since we saw that Employed At and Employeed (Y/N) columns had very large number of null values so we will be dropping them from the dataframe.
#dropping the columns in place
df.drop(columns=['Employed at', 'Employeed (Y/N)'], inplace=True)
#checking the summary column where age is null particularly row 18.
df[df['Age'].isna()][['Age', 'Summary']].loc[18,'Summary']
"Shooter was angry over fact that rival families had been slashing each other's cars. Shooter fired into a block party and injured four."
It appears that summary has no detail about the age of the person shooting.
#checking value counts
df['Age'].value_counts()
44 7 14 7 34 7 18 7 37 6 15 6 45 6 43 6 39 6 20 6 40 6 19 6 41 6 17 6 36 5 22 5 29 5 42 5 28 5 24 5 23 5 16 4 26 4 25 4 32 3 47 3 27 3 59 3 35 3 48 2 64 2 55 2 30 2 57 2 38 2 52 2 31 1 15,16 1 17,18 1 50 1 13,11 1 33 1 13 1 66 1 19,22 1 19,32 1 70 1 12 1 21 1 0 1 51 1 Name: Age, dtype: int64
#number of rows with missing values at Age column
df['Age'].isna().sum()
144
In some rows we have two age numbers, assumming that the first number corresponds to the age of the suspect and second number corresponds to the age of the victim. We will remove the second number from this column.
df['Age']=df['Age'].str.split(',').str.get(0)
We will fill the missing age with '1000' so that we can convert all the numbers in string format to numbers i.e int.
df['Age']=df["Age"].fillna('1000')
#converting the age number from string to int
df["Age"]=df["Age"].astype('int')
For analysis purpose, we are defining the age groups of the people that fall in below buckets:
def age_groups(age : int):
if age==1000:
group='Unknown'
elif age <= 16:
group= 'Child'
elif (age >= 17) & (age < 31):
group= 'Young Adults'
elif (age >= 31) & (age < 46):
group='Middle Aged Adults'
elif (age >= 46) & (age < 66):
group='Old Adults'
else:
group='Elderly'
return group
df['Age Group']= df['Age'].apply(lambda x: age_groups(x))
df['Age Group'].value_counts()
Unknown 144 Middle Aged Adults 70 Young Adults 67 Child 22 Old Adults 18 Elderly 2 Name: Age Group, dtype: int64
#rechecking missing value counts
df.isna().sum()
Title 0 Location 45 Date 0 Incident Area 62 Open/Close Location 28 Target 5 Cause 77 Summary 0 Fatalities 0 Injured 0 Total victims 0 Policeman Killed 6 Age 0 Mental Health Issues 0 Race 2 Gender 0 Latitude 0 Longitude 0 City 45 State 0 State_abb 2 Age Group 0 dtype: int64
df['Race'].value_counts()
White American or European American 122 Black American or African American 76 Unknown 42 Some other race 20 white 12 Asian American 11 White 9 Asian 6 Black 5 Latino 5 Native American or Alaska Native 3 black 3 Other 2 Two or more races 2 Black American or African American/Unknown 1 White American or European American/Some other Race 1 Asian American/Some other race 1 Name: Race, dtype: int64
df['Race']= df['Race'].fillna('Unknown')
There are some rows where Race has two values separarted by '/'. We can keep the first one and remove the second one.
df['Race']=df['Race'].str.split('/').str.get(0)
df.loc[df['Race']=='white', 'Race']='White'
df.loc[df['Race']=='White American or European American', 'Race']='White'
df.loc[df['Race']=='black', 'Race']='Black'
df.loc[df['Race']=='Black American or African American', 'Race']='Black'
df.loc[df['Race']=='Asian American', 'Race']='Asian'
df.loc[df['Race']=='Some other race', 'Race']='Other'
df.loc[df['Race']=='Two or more races', 'Race']='Other'
df['Race'].value_counts()
White 144 Black 85 Unknown 44 Other 24 Asian 18 Latino 5 Native American or Alaska Native 3 Name: Race, dtype: int64
df[df['Policeman Killed'].isna()]['Summary']
S# 6 Randy Stair, a 24-year-old worker at Weis groc... 7 John Robert Neumann, Jr., 45, a former employe... 9 Kori Ali Muhammad, 39, opened fire along a str... 10 Esteban Santiago, 26, flew from Alaska to Fort... 11 Arcan Cetin, 20, killed a teen girl and three ... 28 Three people were killed in a shooting at a ho... Name: Summary, dtype: object
After reading the summary for the these columns with missing value for 'Policeman Killed', found that no police man was killed so filling the nan values with 0.
df['Policeman Killed']=df['Policeman Killed'].fillna(0)
df['Cause'].value_counts()
psycho 68 terrorism 65 anger 44 frustration 18 domestic dispute 12 unemployement 10 revenge 10 racism 6 unknown 4 failing exams 3 domestic disputer 1 suspension 1 religious radicalism 1 drunk 1 breakup 1 robbery 1 Name: Cause, dtype: int64
#filling the NAN with unknown
df['Cause']=df['Cause'].fillna('unknown')
df['Mental Health Issues'].value_counts()
Unknown 110 Yes 106 No 93 Unclear 13 unknown 1 Name: Mental Health Issues, dtype: int64
df.loc[df['Mental Health Issues']=='unknown', 'Mental Health Issues']='Unknown'
#### Gender
df['Gender'].value_counts()
Male 272 Unknown 21 M 20 Female 5 Male/Female 4 M/F 1 Name: Gender, dtype: int64
#mapping the correct names
df['Gender']=df['Gender'].map({'Female': 'Female',
"M": "Male",
'M/F': "Male/Female",
'Male/Female': 'Male/Female',
"Unknown": "Unknown",
"Male": "Male"})
df['Gender'].value_counts()
Male 292 Unknown 21 Male/Female 5 Female 5 Name: Gender, dtype: int64
df['Gender']= df['Gender'].fillna('Unknown')
#value counts on target
df['Target'].value_counts()
random 140 Family 37 Coworkers 17 Students 16 Students+Teachers 14 Ex-Coworkers 9 party guests 8 Ex-Wife 7 Policeman 7 coworkers 5 Teachers 5 neighbors 4 Ex-Wife & Family 3 Friends 2 school girls 2 Children 2 police 2 Ex-Girlfriend 2 Sikhs 1 black men 1 Congresswoman 1 Policeman+Council Member 1 postmaster 1 lawyers 1 basketball players 1 psychologist+psychiatrist 1 Ex-GirlFriend 1 hunters 1 Social Workers 1 Family+students 1 Students+Parents 1 prayer group 1 monks 1 Family+random 1 Coworker's Family 1 Girlfriend 1 women 1 uninvited guests 1 birthday party bus 1 Trooper 1 club members 1 Family/Neighbors 1 drug dealer 1 protestors 1 Marines 1 Ex-girlfriend 1 House Owner 1 Contestant 1 Ex-Girlfriend & Family 1 Ex-Girlfriend+random 1 rapper+random 1 TSA Officer 1 partner's family 1 welding shop employees 1 Name: Target, dtype: int64
df['Target']=df['Target'].fillna('Unknown')
#mapping the correct names to values
df['Target_group']= df['Target'].map({'random': 'Random',
'Ex-girlfriend': 'Ex-girlfriend',
'Ex-GirlFriend': 'Ex-girlfriend',
'Ex-Girlfriend': 'Ex-girlfriend',
'Ex-Girlfriend & Family': 'Ex-girlfriend',
'Ex-Girlfriend+random': 'Ex-girlfriend',
'Coworkers' : 'Coworkers',
'coworkers': 'Coworkers',
'Family': 'Family',
'Students': 'Students',
'party guests': 'Guests',
'uninvited guests': 'Guests',
'Students+Teachers': 'Students',
'women': 'Random',
'school girls': 'Students',
'welding shop employees': 'Employees',
'club members': 'Employees',
'birthday party bus': 'Guests',
'Ex-Coworkers': 'Ex-Coworkers',
'Ex-Wife': 'Ex-Wife',
'Policeman': 'Police',
"partner's family" : "Partner's Family",
"TSA Officer": "Government Officer",
"Trooper" : "Trooper",
"rapper+random" : "Rapper",
"Contestant": "Contestant",
"House Owner": "House Owner",
"Marines": "Marines",
"protestors": "Protestors",
"drug dealer": "Drug Dealer",
"Family/Neighbors" : "Family",
"police": "Police",
"Family+random": "Family",
"Policeman+Council Member": "Police",
"monks": "Religious Leader",
"black men": "Black men",
"Sikhs": "Sikhs",
"psychologist+psychiatrist": "Psychiatrist",
"basketball players": "Basketball players",
"Children": "Children",
"Friends": "Friends",
"Social Workers": "Social Workers",
"Unknown": "Unknown",
"neighbors": "Neighbors",
"Ex-Wife & Family": "Ex-Wife",
"Congresswoman": "Congresswoman",
"hunters": "hunters",
"Students+Parents": "Students",
"postmaster": "Postmaster",
"Teachers": "Teachers",
"Coworker's Family": "Coworker",
"Girlfriend": "Girlfriend",
"lawyers": "Lawyers",
"prayer group": "Prayer group",
"Family+students": "Students"}
)
#converting Date to date time format and extracting the features
df['Date']=pd.to_datetime(df['Date'])
df['Year']=df['Date'].dt.year
df['Month']=df['Date'].dt.month
df['Weekday']=df['Date'].dt.day
print(f"We have data from {df['Year'].min()} till {df['Year'].max()}.")
We have data from 1966 till 2017.
#finally saving the cleaned data as separate csv file
df.to_csv('cleaned_out.csv', index=False)
#importing cleaned data
df= pd.read_csv('cleaned_out.csv')
#reading head
df.head(2)
| Title | Location | Date | Incident Area | Open/Close Location | Target | Cause | Summary | Fatalities | Injured | ... | Latitude | Longitude | City | State | State_abb | Age Group | Target_group | Year | Month | Weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Texas church mass shooting | Sutherland Springs, TX | 2017-11-05 | Church | Close | random | unknown | Devin Patrick Kelley, 26, an ex-air force offi... | 26 | 20 | ... | 29.273200 | -98.056400 | Sutherland Springs | Texas | TX | Young Adults | Random | 2017 | 11 | 5 |
| 1 | Walmart shooting in suburban Denver | Thornton, CO | 2017-11-01 | Wal-Mart | Open | random | unknown | Scott Allen Ostrem, 47, walked into a Walmart ... | 3 | 0 | ... | 39.903057 | -104.954445 | Thornton | Colorado | CO | Old Adults | Random | 2017 | 11 | 1 |
2 rows × 26 columns
#checking info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 323 entries, 0 to 322 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title 323 non-null object 1 Location 278 non-null object 2 Date 323 non-null object 3 Incident Area 261 non-null object 4 Open/Close Location 295 non-null object 5 Target 323 non-null object 6 Cause 323 non-null object 7 Summary 323 non-null object 8 Fatalities 323 non-null int64 9 Injured 323 non-null int64 10 Total victims 323 non-null int64 11 Policeman Killed 323 non-null float64 12 Age 323 non-null int64 13 Mental Health Issues 323 non-null object 14 Race 323 non-null object 15 Gender 323 non-null object 16 Latitude 323 non-null float64 17 Longitude 323 non-null float64 18 City 278 non-null object 19 State 323 non-null object 20 State_abb 321 non-null object 21 Age Group 323 non-null object 22 Target_group 323 non-null object 23 Year 323 non-null int64 24 Month 323 non-null int64 25 Weekday 323 non-null int64 dtypes: float64(3), int64(7), object(16) memory usage: 65.7+ KB
#converting Date column to DateTime format
df['Date']=pd.to_datetime(df['Date'])
df_plot= df.groupby('Year')[['Date']].count()
fig_1=px.bar(df_plot, x=df_plot.index, y='Date', labels={'Date': '# cases'})
fig_1.update_layout(title='US mass shooting incidences per year',
xaxis_title='Year', yaxis_title='Number of cases', xaxis = dict(
tickmode = 'linear', dtick = 2))
fig_1.update_xaxes(tickangle=30)
fig_1.show()
After analysing 50 years of data for US Mass shooting across various states:
df_plot=df.groupby(['Year']).agg({'Date': 'count', 'Fatalities': 'sum', 'Injured': 'sum', 'Total victims': 'sum'}).reset_index().sort_values('Year', ascending=True)
trace1 = go.Scatter(x=df_plot['Year'],
y=df_plot['Fatalities'],
mode='lines+markers',
line=dict(width=1.5),
name='Fatalities')
trace2 = go.Scatter(x = df_plot['Year'],
y = df_plot['Injured'],
mode='lines+markers',
line=dict(width=1.5),
name='Injured')
trace3 = go.Scatter(x = df_plot['Year'],
y = df_plot['Total victims'],
mode='lines+markers',
line=dict(width=1.5),
name='Total victims')
frames = [dict(data= [dict(type='scatter',
x=df_plot['Year'][:k+1],
y=df_plot['Fatalities'][:k+1]),
dict(type='scatter',
x=df_plot['Year'][:k+1],
y=df_plot['Injured'][:k+1]),
dict(type='scatter',
x=df_plot['Year'][:k+1],
y=df_plot['Total victims'][:k+1])],
traces= [0, 1, 2],
)for k in range(1, len(df_plot['Year'].unique().tolist()))]
layout = go.Layout(#width=650,
#height=400,
title='History of US mass shooting in terms of fatalities/injured/total_victims',
xaxis_title='Year',
yaxis_title='Number of individuals affected',
showlegend=False,
hovermode='closest',
updatemenus=[dict(type='buttons', showactive=False,
y=1.05,
x=1.15,
xanchor='right',
yanchor='top',
pad=dict(t=0, r=10),
buttons=[dict(label='Play',
method='animate',
args=[None,
dict(frame=dict(duration=100,
redraw=False),
transition=dict(duration=8),
fromcurrent=True,
mode='immediate')])])])
layout.update(xaxis =dict(range=[1965, 2020]),
yaxis =dict(range=[0, 700]));
fig_2 = go.Figure(data=[trace1, trace2, trace3], frames=frames, layout=layout)
fig_2.update_layout(hovermode="x")
fig_2.show()
fig_3=px.scatter(df, y='Year', x='Month', size= 'Total victims', color='Gender',
hover_name="State", hover_data=["Fatalities", "Injured", "Policeman Killed"])
fig_3.update_layout(
title="Shooter's Gender and affected people in terms of total victims <br> (Hover over the markers to see more details)",
xaxis=dict(
title='Month',
tickmode='linear'),
yaxis=dict(
title='Year',
tickmode='linear',
dtick=2))
fig_3.show()
from plotly.subplots import make_subplots
fig_4 = make_subplots(rows=1, cols=3, shared_yaxes=True, subplot_titles=('Total victims', 'Injured', 'Fatalities'))
fig_4.add_trace(go.Heatmap(
z=df['Total victims'],
x=df['Month'],
y=df['Year'],coloraxis='coloraxis', xaxis='x'
), row=1, col=1)
fig_4.add_trace(go.Heatmap(
z=df['Fatalities'],
x=df['Month'],
y=df['Year'],coloraxis='coloraxis2' , xaxis='x1'
), row=1, col=3)
fig_4.add_trace(go.Heatmap(
z=df['Injured'],
x=df['Month'],
y=df['Year'],coloraxis='coloraxis3' , xaxis='x2'
), row=1, col=2)
fig_4.update_layout(
title='Total Victims per month in a year',
)
fig_4.update_layout(xaxis=dict(
title='Month',
tickmode='linear'),
yaxis=dict(
title='Year',
tickmode='linear',
dtick=2))
fig_4.update_layout(xaxis2=dict(
title='Month',
tickmode='linear'),
xaxis3=dict(
title='Month',
tickmode='linear')
)
fig_4.update_layout(
coloraxis=dict(colorscale='spectral', colorbar_y=0.25, colorbar_thickness=23, colorbar_len= 0.30, colorbar_title= 'Total victims' ),
coloraxis2=dict(colorscale='YlOrRd', colorbar_y=0.50, colorbar_thickness=23, colorbar_len=0.30, colorbar_title= 'Fatalities'),
coloraxis3=dict(colorscale='oryel', colorbar_y=0.75, colorbar_thickness=23, colorbar_len=0.30, colorbar_title= 'Injured'))
fig_4.show()
In the recent years mass shooting incidences have become common in almost all the months.
The maximum number of victims reported on 2017 in the month of October and second in the line was reported in the month of June in year 2016.
df_plot=df.groupby('Month')[['Title']].count()
fig_5=px.bar(df_plot, x=df_plot.index, y='Title', labels = {'Title': 'Incidences'})
fig_5.update_layout(
title='US Mass Shooting - Total Incidences per Month',
xaxis=dict(
tickmode='linear', dtick=2))
fig_5.show()
It can be seen that February is the month with highest number of Mass shooting incidences recorded. Also, winter months in general have higher number of cases.
Seasons have a demonstrated affect on mood, hostility, anger, irritability, and anxiety. Additional detailed data and further analysis can be conducted to understand if there is any relationship with season and mass shooting.
fig_6=px.scatter(df, x='Injured', y='Fatalities', hover_name="Title", hover_data=["Date", "State", "Target_group"], color='Total victims')
fig_6.update_layout(
title='Fatalities Vs Injured')
fig_6.show()
Las Vegas Strip mass shooting case is the outlier here with Injured being ten times the number of fatalities.
df_plot=df['Mental Health Issues'].value_counts(normalize=True)*100
fig_7=px.pie(df_plot, values=df_plot.values, names=df_plot.index, labels={'index':'Mental Health Issue'})
fig_7.update_layout(title='Percentage of Shootings with mental health issues')
fig_7.show()
32.8% of the incidences involved shooter with mental health condition while 28.8% had no mental health issue. Gor 35% of the incidences, it is unknown if the shooter had any mental heath issue.
fig_8 = px.sunburst(df, path=['Race', 'Gender', 'Mental Health Issues'], values='Total victims', labels={'labels':'Mental Health Issues',
'id': 'Race'})
fig_8.update_layout(title= 'Shooter Race/Gender/Mental Health condition and victims involved in shooting')
fig_8.show()
fig_9 = go.Figure()
fig_9.add_trace(go.Scattergeo(
locationmode = 'USA-states',
# locations=df['State'],
text=[f'State: {x}; Fatalities: {y}; Target_group : {z}' for x,y, z in list(zip(df['State'], df['Fatalities'],df['Target_group'] ))] ,
lat= df['Latitude'],
lon= df['Longitude'],
marker = dict(
size = df['Fatalities']),
))
fig_9.update_layout(
title_text = 'US Mass shooting fatalities by states',
showlegend = True,
geo = dict(
scope = 'usa'
)
)
fig_9.show()
Nevada, Florida, Texas, California, Virginia and Connecticut have large number of fatalities.
df_1=df.sort_values('Year', ascending=True)
fig_10= px.choropleth(df_1, locations=df['State_abb'],locationmode="USA-states", color='Fatalities'
,range_color=(min(df_1.Fatalities), max(df_1.Fatalities))
,hover_name="State",
color_continuous_scale=px.colors.sequential.Plasma,
animation_frame="Year",
scope="usa"
)
fig_10.update_layout(title='State wise fatalities from 1966 to 2017 <br>(Please select the Play button for animation)')
fig_10.show()
It can be seen that in the recent years more states have mass shooting cases reported tjhan ever before.
states_abb= {
'AL': 'Alabama',
'KY': 'Kentucky',
'OH': 'Ohio',
'AK': 'Alaska',
'LA': 'Louisiana',
'OK': 'Oklahoma',
'AZ': 'Arizona',
'ME': 'Maine',
'OR': 'Oregon',
'AR': 'Arkansas',
'MD': 'Maryland',
'PA': 'Pennsylvania',
'AS': 'American Samoa',
'MA': 'Massachusetts',
'PR': 'Puerto Rico',
'CA': 'California',
'MI': 'Michigan',
'RI': 'Rhode Island',
'CO': 'Colorado',
'MN': 'Minnesota',
'SC': 'South Carolina',
'CT': 'Connecticut',
'MS': 'Mississippi',
'SD': 'South Dakota',
'DE': 'Delaware',
'MO': 'Missouri',
'TN': 'Tennessee',
'DC': 'District of Columbia',
'MT': 'Montana',
'TX': 'Texas',
'FL': 'Florida',
'NE': 'Nebraska',
'GA': 'Georgia',
'NV': 'Nevada',
'UT': 'Utah',
'GU': 'Guam',
'NH': 'New Hampshire',
'VT': 'Vermont',
'HI': 'Hawaii',
'NJ': 'New Jersey',
'VA': 'Virginia',
'ID': 'Idaho',
'NM': 'New Mexico',
'VI': 'Virgin Islands',
'IL': 'Illinois',
'NY': 'New York',
'WA': 'Washington',
'IN': 'Indiana',
'NC': 'North Carolina',
'WV': 'West Virginia',
'IA': 'Iowa',
'ND': 'North Dakota',
'WI': 'Wisconsin',
'KS': 'Kansas',
'MP': 'Northern Mariana Islands',
'WY': 'Wyoming',
'TT': 'Trust Territories'
}
#creating a df for the dictionary
df2=pd.Series(states_abb).to_frame('Names').reset_index()
#saving the value count of df['State_abb'] in a df
df3=df['State_abb'].value_counts().to_frame('Incidences').reset_index()
#merging the two dataframe
df4=pd.merge(df2, df3, how='left', left_on='index', right_on='index')
df4
df4.fillna(0, inplace=True)
fig_11= px.choropleth(df4,
locations='index',locationmode="USA-states", color='Incidences'
,hover_name="Names",
scope="usa"
)
fig_11.add_scattergeo(
locations=df4.loc[df4['Incidences']==0]['index'],
locationmode="USA-states",
mode='markers',
text='No Incidences reported',
marker=dict(size=12, symbol="diamond", line=dict(width=2, color="DarkSlateGrey"))
)
fig_11.update_layout(title='State wise fatalities from 1966 to 2017 <br>(States with diamond marker have no incidences of mass shooting)')
fig_11.show()
df.loc[df['Cause']=='domestic disputer', 'Cause']='domestic dispute'
df6=pd.crosstab(df['Cause'], df['Target_group'], values=df['Total victims'], aggfunc='sum').fillna(0)
fig_12 = go.Figure(data=go.Heatmap( z=df6.loc[:, df6.columns != 'Random'],
y=df6.loc[:, df6.columns != 'Random'].index,
x=df6.loc[:, df6.columns != 'Random'].columns,
colorscale="Reds"))
fig_12.update_layout(title='Cause Vs Target Group <br> (Excluded the target_group Random as majority of cases involved a random target)')
fig_12.show()
fig_13 = go.Figure(data=go.Heatmap( z=pd.crosstab(df['Race'], df['Age Group']),
y=pd.crosstab(df['Race'], df['Age Group']).index,
x=pd.crosstab(df['Race'], df['Age Group']).columns,
colorscale="Reds"))
# fig.update_layout(xticks=pd.crosstab(df['Race'], df['Age Group']).index)
fig_13.show()